Regular Expressions in Java
By Chip Jones, OCI Senior Software Engineer
October 2001
Introduction
Regular expressions (regexs or REs) are strings that describe patterns of characters in other strings. They are effective tools for searching and manipulating text and are built-in features of Perl, Python, and many other scripting languages. They are a ubiquitous aspect of programming and computer science, but they have been absent from the standard Java API before now.
This article discusses the java.util.regex
package, which is a new feature of the J2SE 1.4.
The J2SE 1.4 Regex API
Package java.util.regex
provides regular expression-based pattern matching. It contains a Pattern
class and a Matcher
class.
Matcher
instances use Pattern
instances to find, replace, and otherwise compare a regex to an input character sequence.
Patterns
The Pattern
class is a regex container. It is instantiated by "compiling" an expression with either
Pattern pat = Pattern.compile("x*yz*");
or
boolean isMatch = Pattern.matches("x*yz*", "xxxyzzz");
Both examples create a pattern that matches any number of 'x' characters followed by a single 'y' character and then by any number of 'z' characters. The strings "y", "xxy" and "xyz" are valid matches for the pattern.
The matches()
method in the second example is a shortcut compile and match. For one-time tests of a pattern, the matches()
method of the Pattern
class eliminates the need to instantiate a Matcher
and call its matches()
method. However, as it does not allow a compiled expression to be reused, it is less efficient if the match is repeated several times.
The regex grammar recognized by the Pattern
class is similar in many respects to Perl. The list below shows some of its common expressions.
Expression | Matches |
---|---|
. | any character |
x | character x |
x? | zero or one of character x |
x* | zero or more of character x |
\t \n \r \f | tab, line feed, carriage-return, form-feed |
\d \D | digit, non-digit |
\s \S | whitespace, non-whitespace |
\w \W | word, non-word |
[a-z] | the lowercase characters a through z inclusively |
[A-Z] | the uppercase characters A through Z inclusively |
^ | the beginning of a line |
$ | the end of a line |
Matchers
A Matcher
object is instantiated by invoking the matcher()
method of a Pattern
instance.
Matchers are used to search or manipulate a specified string.
Suppose a programmer needs to swap one substring for another. He might choose to deconstruct the string either with a StringTokenizer
or with methods in the String
class. He would have to write logic to disassemble the string, swap the tokens and reassemble the string.
Alternately and in C-fashion, he could "walk" the string a character at a time and make the appropriate character substitutions. He would have to look ahead and account for shifting characters if the swapped string and substring were not the same length.
In contrast to this complexity, is the ease with which Matcher
objects can replace one substring with another.
- // swap 'are' and 'is'
- Pattern pat = Pattern.compile("are");
- Matcher mat = pat.matcher("Java are fun.");
- String sentence = mat.replaceAll("is"); // 'Java is fun.'
In addition to substitution, the Matcher
class provides methods that find the next matching substring, test complete and partial string matches and return matched substrings.
Regex Additions to the String Class
The split()
method, also new in the J2SE 1.4, is a regex-like addition to the String
class. It is similar to the split routine in Perl. It uses an input regex as a delimiter and deconstructs the contents of an input string into an array of strings.
This method is useful for parsing character delimited text files or user input. To parse a colon-delimited text string from a user password file once could use:
- String x = "joe:x:670:500::/home/joe:/bin/false";
- String arr[] = x.split(":");
The contents of arr[]
are { "joe", "x", "670", "500", "", "home/joe", "/bin/false" }
.
A Form Validation Example
Often, websites relegate form validation to the browser and JavaScript. Authors of heavily trafficked sites may choose browser-side form validation to offload server-side processing. Problems can arise when users disable scripting in their browsers, and server cycles ultimately may be traded for developer cycles, as the JavaScripts are a separate codebase that must be maintained.
With regular expressions, server-side form validation is easily implemented.
Consider a small web application developed in anticipation of an area code change from 318 to 543. The user enters his name and phone number. A servlet or JSP verifies the proper format of the name and phone number and checks a data source to see if the area code is changing. If so, it displays the phone number with the new area code.
A war file of this application is available for download.
Below is the bean that implements its regular expression logic.
- package com.ociweb.jnb;
-
- import java.util.regex.*;
-
- public class RegexValidate {
- private String name ="";
- private String phone = "";
- private String response =
- "Your name and phone number are not on record.";
-
- // area codes 314 will change; 318 will not
- private static final String OLD318 = "318";
- private static final String NEW318 = "543";
-
- // precompiled patterns
- // notice double escape of special characters
- private Pattern phonePattern =
- Pattern.compile("\\d\\d\\d-\\d\\d\\d-\\d\\d\\d\\d");
- private Pattern namePattern =
- Pattern.compile("^[A-Z]\\s[A-Z][a-z]*");
- private Pattern newAreaCodePattern =
- Pattern.compile("^31\\d");
-
- private String nameList[]= {
- // "name:phoneNumber:newAreaCode
- "S White:314-555-3141",
- "J Smith:318-555-3147",
- "C Brown:314-555-6543",
- "T Hogan:318-555-3180",
- "E James:636-555-8970"
- };
-
- private int NUMNAMES = java.lang.reflect.Array.getLength(nameList);
-
- public RegexValidate() { }
-
- public void setName (String name) {
- this.name = name;
- if(!isValidName())
- throw new IllegalArgumentException("Invalid Name.");
- }
-
- public void setPhone (String phone) {
- this.phone = phone;
- if(!isValidPhone())
- throw new IllegalArgumentException("Invalid Phone Number.");
- }
-
- private boolean isValidPhone () {
- Matcher phoneMat = phonePattern.matcher(this.phone);
- return phoneMat.matches();
- }
-
- private boolean isValidName () {
- Matcher nameMat = namePattern.matcher(this.name);
- return nameMat.matches();
- }
-
- public String getResponse () {
- boolean found = false;
- String nameData[] = new String[2];
-
- for (int ctr = 0; ctr < NUMNAMES && !found; ctr++) {
- nameData = this.nameList[ctr].split(":");
-
- if (nameData[0].equals(this.name) &&
- nameData[1].equals(this.phone)) {
-
- found = true;
- this.response = "Your phone number, " +
- this.phone + ", will not change.";
-
- // check old and new area code
- Matcher mat = newAreaCodePattern.matcher(this.phone);
-
- if(mat.find())
- if(mat.group().equals(OLD318)) {
- this.response = "Your new phone number is " +
- mat.replaceAll(NEW318);
- }
- }
- }
-
- return this.response;
- }
- }
The isValidPhone()
and isValidName()
methods use precompiled patterns to limit allowed input for the phone number and name fields. The first constrains phone numbers to a 10-digit hyphenated format. The second ensures that names are properly capitalized and entered as first initial space last name.
The getResponse()
method uses the split()
method to deconstruct a list of names and phone numbers. The split phone numbers are matched against a pattern that selects the area code. While the match may be performed against each record in the data source, all the matches are against the same pattern, so the code is more efficient if the pattern is compiled outside of the getResponse()
method and reused.
The pattern '^31\\d' matches both '314' and '318'. Since 318 is the only area code that changes, the example could use a pattern that matched only 318. The partial match pattern was chosen to show the versatility of the regex implementation and to demonstrate the group()
method.
The example uses the group()
method of the Matcher
class to return the substring of the last match. It then checks the substring and calls replaceAll()
if the substring is '318'. The replaceAll()
method changes the area code to '543'. Since the pattern only matches digits at the start of the line, replaceAll()
does not replace any '318' substrings that occur later in the string.
Summary
The new java.util.regex
package provides regular expression functionality that has been absent from the standard Java API.
The methods of the new Matcher
and Pattern
classes and the grammar they support let developers describe and manipulate sequences of characters succinctly. They can replace string machinations with simpler regex constructs that are more powerful and easier to use and maintain.
References
Other Regex Implementations
For those who don't have access to the J2SE 1.4 or are tied to earlier versions of the JDK, there are several third-party regular expression packages. Among them are ORO and Regexp. Both are part of the Jakarta project.
ORO boasts more features and seems to have more active development. The Free Software Foundation distributes the gnu.regex package, and Pat is a regex package compatible with JDK 1.0.
Other Sources of Information
- [1] J2SE 1.4.0 Beta 2
http://java.sun.com/j2se/1.4 - [2] Javasoft
http://www.javasoft.com - [3 Package gnu.regexp
http://www.cacas.org/~wes/java - [4] Package java.util.regex
http://java.sun.com/j2se/1.4/docs/api/java/util/regex/package-summary.html - [5] ORO
http://jakarta.apache.org/oro - [6] PAT
http://www.javaregex.com - [7] Regexp
http://jakarta.apache.org/regexp
Software Engineering Tech Trends (SETT) is a regular publication featuring emerging trends in software engineering.