2/1/12

Regular Expressions in Java

A regular expression is a kind of pattern that can be applied to text (String, in Java). Java provides the java.util.regex package for pattern matching with regular expressions. Java regular expressions are very similar to the Perl programming language and very easy to learn.

A regular expression either matches the text ( or a part of it) or it fails to match.
* If regular expression matches a part of text then we can find it out which one.
** If regular expression in complex, then we can easily find out which part of the regular expression matches with which part of the text.

A First Example


The regular expression "[a-z]+" matches all lower case letters in the text.
        [a-z] means any character from a to z, inclusive and + means "one or    more".

Suppose we supply a string "code 2 learn java tutorial".


How to do it in Java


First, you must compile the pattern :
      import java.util.regex.*;
      Pattern p = Pattern.compile("[a-z]+");

Next you must create a matcher for the text by sending a message to the pattern :
Matcher m = p.matcher("code 2 learn java tutorial");


NOTE : 

Neither Pattern nor Matcher have a public constructor, we create it by using methods in Pattern class.


Pattern Class: A Pattern object is a compiled representation of a regular expression. The Pattern class provides no public constructors. To create a pattern, you must first invoke one of its public static compile methods, which will then return a Pattern object. These methods accept a regular expression as the first argument.

Matcher Class: A Matcher object is the engine that interprets the pattern and performs match operations against an input string. Like the Pattern class, Matcher defines no public constructors. You obtain a Matcher object by invoking the matcher method on a Pattern object.

After we have done the above steps, and now that we have matcher m, we can check whether the match has been found or not and if yes then from which index position it starts, etc.

m.matches() returns true if the pattern matches the entire string or else false.
m.lookingAt() returns true if the pattern matches at the beginning of the string , and false otherwise.
m.find() returns true if pattern matches any part of the text.


Finding what was matched


After a successful match, m.start() will return the index of the first character matched and m.end() will return the index of the last character matched, plus one.

If no match was attempted, or if the match was unsuccessful, m.start() and m.end() will throw an IllegalStateException
         – This is a RuntimeException, so you don’t have to catch it

It may seem strange that m.end() returns the index of the last character matched plus one, but this is just what most String methods require
              – For example, "Now is the time".substring(m.start(), m.end())
will return exactly the matched substring.


Java Program : 



import java.util.regex.*;

public class RegexTest {
   public static void main(String args[]) {
      String pattern = "[a-z]+";
      String text = "code 2 learn java tutorial";
      Pattern p = Pattern.compile(pattern);
      Matcher m = p.matcher(text);
      while (m.find()) {
          System.out.print(text.substring(m.start(), m.end()) + "*");
      }
  }
}

Output: code*learn*java*tutorial*.


Additional Methods


If m is a matcher, then

m.replaceFirst(replacement) returns a new String where the first substring matched by the pattern has been replaced by replacement
m.replaceAll(replacement) returns a new String where every substring matched by the pattern has been replaced by replacement
m.find(startIndex) looks for the next pattern match, starting at the specified index
m.reset() resets this matcher
m.reset(newText) resets this matcher and gives it new text to examine (which may be a String, StringBuffer, or CharBuffer)




Regular Expression Syntax


Here is the table listing down all the regular expression metacharacter syntax available in Java:

SubexpressionMatches
^Matches beginning of line.
$Matches end of line.
.Matches any single character except newline. Using m option allows it to match newline as well.
[...]Matches any single character in brackets.
[^...]Matches any single character not in brackets
\ABeginning of entire string
\zEnd of entire string
\ZEnd of entire string except allowable final line terminator.
re*Matches 0 or more occurrences of preceding expression.
re+Matches 1 or more of the previous thing
re?Matches 0 or 1 occurrence of preceding expression.
re{ n}Matches exactly n number of occurrences of preceding expression.
re{ n,}Matches n or more occurrences of preceding expression.
re{ n, m}Matches at least n and at most m occurrences of preceding expression.
a| bMatches either a or b.
(re)Groups regular expressions and remembers matched text.
(?: re)Groups regular expressions without remembering matched text.
(?> re)Matches independent pattern without backtracking.
\wMatches word characters.
\WMatches nonword characters.
\sMatches whitespace. Equivalent to [\t\n\r\f].
\SMatches nonwhitespace.
\dMatches digits. Equivalent to [0-9].
\DMatches nondigits.
\AMatches beginning of string.
\ZMatches end of string. If a newline exists, it matches just before newline.
\zMatches end of string.
\GMatches point where last match finished.
\nBack-reference to capture group number "n"
\bMatches word boundaries when outside brackets. Matches backspace (0x08) when inside brackets.
\BMatches nonword boundaries.
\n, \t, etc.Matches newlines, carriage returns, tabs, etc.
\QEscape (quote) all characters up to \E
\EEnds quoting begun with \Q

SHARE THIS POST:

2 comments:

  1. Regular expression is a must have tool for developers and good command on it always makes you more productive.

    Thanks
    Javin
    How to convert String to Enum in Java

    ReplyDelete