Recent Posts

Sunday, 5 June 2016

Regular Expression Tutorial


* A regular expression defines a search pattern for strings.

* Regular expressions can be used to search, edit and manipulate text.

* The pattern defined by the regular expression may match one or several times or not at all for a given string.

* The abbreviation for regular expression is regex.

* The process of analysing or modifying a text with a regex is called: The regular expression is applied to the text (string).

* The pattern defined by the regex is applied on the text from left to right. Once a source character has been used in a match, it cannot be reused. For example, the regex aba will match ababababa only two times (aba_aba__).

* A simple example for a regular expression is a (literal) string. For example, the Hello World regex will match the "Hello World" string.

* . (dot) is another example for a regular expression. A dot matches any single character; it would match, for example, "a" or "z" or "1". Support for regular expressions in programming languages.

Support for regular expressions in programming languages

     Regular expressions are supported by most programming languages, e.g., Java, Perl, Groovy, etc. Unfortunately each language supports regular expressions slightly different.
   
     The following description is an overview of available meta characters which can be used in regular expressions. This chapter is supposed to be a reference for the different regex elements.

1. Common matching symbols



2. Meta characters

     The following meta characters have a pre-defined meaning and make certain common patterns easier to use, e.g., \d instead of [0..9].


3. Quantifier

     A quantifier defines how often an element can occur. The symbols ?, *, + and {} define the quantity of the regular expressions


Subexpression
Matches
^
Matches beginning of line.
$
Matches end of line.
.
Matches any single character except newline. Using m option allows it to match newline as well.
[...]
Matches any single character in brackets.
[^...]
Matches any single character not in brackets
\A
Beginning of entire string
\z
End of entire string
\Z
End of entire string except allowable final line terminator.
re*
Matches 0 or more occurrences of preceding expression.
re+
Matches 1 or more of the previous thing
re?
Matches 0 or 1 occurrence of preceding expression.
re{ n}
Matches exactly n number of occurrences of preceding expression.
re{ n,}
Matches n or more occurrences of preceding expression.
re{ n, m}
Matches at least n and at most m occurrences of preceding expression.
a| b
Matches either a or b.
(re)
Groups regular expressions and remembers matched text.
(?: re)
Groups regular expressions without remembering matched text.
(?> re)
Matches independent pattern without backtracking.
\w
Matches word characters.
\W
Matches nonword characters.
\s
Matches whitespace. Equivalent to [\t\n\r\f].
\S
Matches nonwhitespace.
\d
Matches digits. Equivalent to [0-9].
\D
Matches nondigits.
\A
Matches beginning of string.
\G
Matches point where last match finished.
\n
Back-reference to capture group number "n"
\b
Matches word boundaries when outside brackets. Matches backspace (0x08) when inside brackets.
\B
Matches non word boundaries.
\n, \t, etc.
Matches newlines, carriage returns, tabs, etc.
\Q
Escape (quote) all characters up to \E
\E
Ends quoting begun with \Q

Examples
1. Matching a Username
^ [a-z0-9_\.]{3,15}$
Description
^               # Start of the line
[a-z0-9_\.]     # Match characters and symbols in the list, a-z, 0-9, underscore, dot
{3,15}          # Length at least 3 characters and maximum length of 15
$               # End of the line
     Whole combination is means, 3 to 15 characters with any lower case character, digit or special symbol “_”, “.” only. This is common username pattern that’s widely use in different websites.
Example
package com.ashok.regularexpressions;

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class UsernameValidator {
    private static Pattern usrNamePtrn = Pattern.compile("^[a-z0-9_\\.]{6,14}$");

    public static boolean validateUserName(String userName) {
        Matcher mtch = usrNamePtrn.matcher(userName);
        if (mtch.matches()) {
           return true;
        }
        return false;
    }

    public static void main(String args[]) {
        System.out.println("Is 'mvashok21' a valid user name? " + validateUserName("mvashok21"));
        System.out.println("Is 'ashok' a valid user name? " + validateUserName("ashok"));
        System.out.println("Is 'MVASHOK' a valid user name? " + validateUserName("MVASHOK"));
        System.out.println("Is 'mv.2.ashok' a valid user name? " + validateUserName("mv.2.ashok"));
        System.out.println("Is 'mv_2-ashok' a valid user name? " + validateUserName("mv_2-ashok"));
    }
}

Output
Is 'mvashok21' a valid user name? true
Is 'ashok' a valid user name? false
Is 'MVASHOK' a valid user name? false
Is 'mv.2.ashok' a valid user name? true
Is 'mv_2-ashok' a valid user name? false
2. Matching a Password 
((?=.*\d)(?=.*[a-z])(?=.*[A-Z])(?=.*[@#$%]).{6,20})
Description 
(               # Start of group
(?=.*\d)        # must contains one digit from 0-9
(?=.*[a-z])     # must contains one lowercase characters
(?=.*[A-Z])     # must contains one uppercase characters
(?=.*[@#$%])    # must contains one special symbols in the list "@#$%"
.               # match anything with previous condition checking
{6,20}          # length at least 6 characters and maximum of 20
)               # End of group
      ?= means apply the assertion condition, meaningless by itself, always work with other combination 
     Whole combination is means, 6 to 20 characters string with at least one digit, one upper case letter, one lower case letter and one special symbol (“@#$%”). This regular expression pattern is very useful to implement a strong and complex password. 
Example
package com.ashok.regularexpressions;

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class PasswordValidator {
    private static Pattern password = Pattern.compile("((?=.*\\d)(?=.*[a-z])(?=.*[A-Z])(?=.*[@#$%]).{6,15})");
    public static boolean validatePassword(String userName) {
        Matcher mtch = password.matcher(userName);
        if (mtch.matches()) {
            return true;
        }
        return false;
    }

    public static void main(String args[]) {
        System.out.println("Is 'mvashok21' a valid password? " + validatePassword("mvashok21"));
        System.out.println("Is 'mvashok' a valid password? " + validatePassword("mvashok"));
        System.out.println("Is 'mvAshok21$' a valid password? " + validatePassword("mvAshok21$"));
        System.out.println("Is '234aBc#' a valid password? " + validatePassword("234aBc#"));
    }
}

Output
Is 'mvashok21' a valid password? false
Is 'mvashok' a valid password? false
Is 'mvAshok21$' a valid password? true
Is '234aBc#' a valid password? true
3. Validate Hex color code
^# ([A-Fa-f0-9]{6}|[A-Fa-f0-9]{3})$
Description
^               # start of the line
#               # must constains a "#" symbols
(               # start of group #1
[A-Fa-f0-9]{6}  # any strings in the list, with length of 6
|               # or
[A-Fa-f0-9]{3}  # any strings in the list, with length of 3
)               # end of group #1
$               # end of the line
     Whole combination is means, string must start with a “#” symbol , follow by a letter from “a” to “f”, “A” to “Z” or a digit from “0? to 9? with exactly 6 or 3 length. This regular expression pattern is very useful for the Hexadecimal web colors code checking.
Example
package com.ashok.regularexpressions;
 
import java.util.regex.Matcher;
import java.util.regex.Pattern;
 
public class HexValidator {
    private static Pattern hexPtrn = Pattern.compile("^#([A-Fa-f0-9]{6}|[A-Fa-f0-9]{3})$");

    public static boolean validateHexValue(String hex) {
        Matcher mtch = hexPtrn.matcher(hex);
        if (mtch.matches()) {
           return true;
        }
        return false;
    }
    
    public static void main(String a[]) {
        System.out.println("Is '#ff0000 ' a valid user name? " + validateHexValue("#ff0000"));
        System.out.println("Is 'ashok' a valid user name? " + validateHexValue("ashok"));
        System.out.println("Is '#800000' a valid user name? " + validateHexValue("#800000"));
        System.out.println("Is '#ffbf00' a valid user name? " + validateHexValue("#ffbf00"));
        System.out.println("Is 'e60000' a valid user name? " + validateHexValue("e60000"));
    }
}

Output
Is '#ff0000 ' a valid user name? true
Is 'ashok' a valid user name? false
Is '#800000' a valid user name? true
Is '#ffbf00' a valid user name? true
Is 'e60000' a valid user name? false
4. Email Regular Expression Pattern
^ [_A-Za-z0-9-\\+]+(\\.[_A-Za-z0-9-]+)*
@[A-Za-z0-9-]+(\\.[A-Za-z0-9]+)*(\\.[A-Za-z]{2,})$;
Description
^                     #start of the line
[_A-Za-z0-9-\\+]+     # must start with string in the bracket [ ], must contains one or more (+)
(                     # start of group #1
\\.[_A-Za-z0-9-]+     # follow by a dot "." and string in the bracket [ ], must contains one or more (+)
)*                    # end of group #1, this group is optional (*)>
@                     # must contains a "@" symbol
[A-Za-z0-9-]+         # follow by string in the bracket [ ], must contains one or more (+)
(                     # start of group #2 - first level TLD checking
\\.[A-Za-z0-9]+       # follow by a dot "." and string in the bracket [ ], must contains one or more (+)
)*                    # end of group #2, this group is optional (*)
(                     # start of group #3 - second level TLD checking
\\.[A-Za-z]{2,}       # follow by a dot "." and string in the bracket [ ], with minimum length of 2
)                     # end of group #3
$                     # end of the line
     The combination means, email address must start with “_A-Za-z0-9-\\+” , optional follow by “.[_A-Za-z0-9-]“, and end with a “@” symbol. The email’s domain name must start with “A-Za-z0-9-”, follow by first level Tld (.com, .net) “.[A-Za-z0-9]” and optional follow by a second level Tld (.com.au, .com.my) “\\.[A-Za-z]{2,}”, where second level Tld must start with a dot “.” and length must equal or more than 2 characters.
Example
package com.ashok.regularexpressions;

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class EmailValidator {
     private static Pattern email = Pattern.compile("^[_A-Za-z]+(\\.[_A-Za-z0-9-]+)*@[A-Za-z0-9]+(\\.[A-Za-z0-9]+)*(\\.[A-Za-z]{2,})$");

     public static boolean validateEmailAddress(String userName) {
         Matcher mtch = email.matcher(userName);
         if (mtch.matches()) {
            return true;
         }
         return false;
     }

     public static void main(String args[]) {
        System.out.println("Is 'ashokkumar.mariyala@gmail.com' a valid email address? " + validateEmailAddress("ashokkumar.mariyala@gmail.com"));
        System.out.println("Is 'ashok*7*&@yahoo.com' a valid email address? " + validateEmailAddress("ashok*7*&@yahoo.com"));
        System.out.println("Is 'mvashok@gmail.com' a valid email address? " + validateEmailAddress("mvashok@gmail.com"));
        System.out.println("Is 'MVASHOK.gmail.com' a valid email address? " + validateEmailAddress("MVASHOK.gmail.com"));
     }
}

Output
Is 'ashokkumar.mariyala@gmail.com' a valid email address? true
Is 'ashok*7*&@yahoo.com' a valid email address? false
Is 'mvashok@gmail.com' a valid email address? true
Is 'MVASHOK.gmail.com' a valid email address? false
5. Image File Extension
([^\s]+(\.(?i)(jpg|png|gif|bmp))$)
Description
(         #Start of the group #1
[^\s]+    # must contains one or more anything (except white space)
(         # start of the group #2
\.        # follow by a dot "."
(?i)      # ignore the case sensive checking for the following characters
(         # start of the group #3
jpg       # contains characters "jpg"
|         # or
png       # contains characters "png"
|         # or
gif       # contains characters "gif"
|         # or
bmp       # contains characters "bmp"
)         # end of the group #3
)         # end of the group #2
$         # end of the string
)         #end of the group #1
     Whole combination is means, must have 1 or more strings (but not white space), follow by dot “.” and string end in “jpg” or “png” or “gif” or “bmp” , and the file extensive is case-insensitive.
     This regular expression pattern is widely use in for different file extensive checking. You can just change the end combination (jpg|png|gif|bmp) to come out different file extension checking that suit your need.
package com.ashok.regularexpressions;

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class ImageValidator {
    private static Pattern image = Pattern.compile("([^\\s]+(\\.(?i)(jpg|png|gif|bmp|img))$)");

    public static boolean validateImage(String userName) {
        Matcher mtch = image.matcher(userName);
        if (mtch.matches()) {
           return true;
        }
        return false;
    }

    public static void main(String a[]) {
        System.out.println("Is 'ashok.img' a valid image? " + validateImage("ashok.img"));
        System.out.println("Is 'ashok.doc' a valid image? " + validateImage("ashok.doc"));
        System.out.println("Is 'ashok@gif' a valid image? " + validateImage("ashok@gif"));
    }
}

Output
Is 'ashok.img' a valid email address? true
Is 'ashok.doc' a valid email address? false
Is 'ashok@gif' a valid email address? false
6. IP Address Regular Expression Pattern
^([01]?\\d\\d?|2[0-4]\\d|25[0-5])\\.([01]?\\d\\d?|2[0-4]\\d|25[0-5])\\.
([01]?\\d\\d?|2[0-4]\\d|25[0-5])\\.([01]?\\d\\d?|2[0-4]\\d|25[0-5])$
Description
^             # start of the line
(             # start of group #1
[01]?\\d\\d?  # Can be one or two digits. If three digits appear, it must start either 0 or 1 [e.g ([0-9], [0-9][0-9],[0-1][0-9][0-9])]
|             # or
2[0-4]\\d     # start with 2, follow by 0-4 and end with any digit (2[0-4][0-9])
|             # or
25[0-5]       # start with 2, follow by 5 and ends with 0-5 (25[0-5])
)             # end of group #2
\.            # follow by a dot "."
....          # repeat with 3 times (3x)
$             #end of the line
     Whole combination means, digit from 0 to 255 and follow by a dot “.”, repeat 4 time and ending with no dot “.” Valid IP address format is “0-255.0-255.0-255.0-255".
package com.ashok.regularexpressions;

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class IpValidator {
    public static boolean isValidIP(String ipAddr) {
        Pattern ptn = Pattern.compile("^([01]?\\d\\d?|2[0-4]\\d|25[0-5])\\."
                + "([01]?\\d\\d?|2[0-4]\\d|25[0-5])\\."
                + "([01]?\\d\\d?|2[0-4]\\d|25[0-5])\\."
                + "([01]?\\d\\d?|2[0-4]\\d|25[0-5])$");
        Matcher mtch = ptn.matcher(ipAddr);
        return mtch.find();
    }
    public static void main(String args[]) {
        System.out.println("10.23.45.12 is valid? " + IpValidator.isValidIP("10.23.45.12"));
        System.out.println("10.2a.56.32 is valid? " + IpValidator.isValidIP("10.2a.56.32"));
        System.out.println("10.23.45 is valid? "+ IpValidator.isValidIP("10.23.45"));
        System.out.println("100.230.45.589 is valid? "+ IpValidator.isValidIP("100.230.45.589"));
    }
}'

Output
10.23.45.12 is valid? true
10.2a.56.32 is valid? false
10.23.45 is valid? false
100.230.45.589 is valid? false
7. Time in 12-Hour Format
(1[012]|0[0-9]|00):[0-5][0-9](\\s)?(?i)(am|pm)
Description
(          #start of group #1
1[012]     # start with 10, 11, 12
|          # or
0[0-9]     # start with 01,02,...09
|          # or
00         # start with 00
)          # end of group #1
:          # follow by a semi colon (:)
[0-5][0-9] # follw by 0..5 and 0..9, which means 00 to 59
(\\s)?     # follow by a white space (optional)
(?i)       # next checking is case insensitive
(am|pm)    # follow by am or pm
     The 12-hour clock format is start from 0-12, then a semi colon (:) and follow by 00-59 , and end with am or pm.
package com.ashok.regularexpressions;

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Time12HourValidator {
    public static boolean isValidTime(String ipAddr) {
        Pattern ptn = Pattern.compile("(1[012]|0[1-9]|00):[0-5][0-9](\\s)?(?i)(am|pm)");
        Matcher mtch = ptn.matcher(ipAddr);
        return mtch.find();
    }

    public static void main(String a[]) {
        System.out.println("10:53 am is valid? " + Time12HourValidator.isValidTime("10:53 am"));
        System.out.println("13:25 am is valid? " + Time12HourValidator.isValidTime("13:25 am"));
        System.out.println("10:23 fm is valid? "+ Time12HourValidator.isValidTime("10:23 fm"));
        System.out.println("07.53 pm is valid? " + Time12HourValidator.isValidTime("07.53 am"));
    }
}

Output
10:53 am is valid? true
13:25 am is valid? false
10:23 fm is valid? false
07.53 pm is valid? false
8. Date Format (dd/mm/yyyy)
(0?[1-9]|[12][0-9]|3[01])/(0?[1-9]|1[012])/((19|20)\\d\\d)
Description
(             # start of group #1
0?[1-9]       # 01-09 or 1-9
|             # or
[12][0-9]     # 10-19 or 20-29
|             # or
3[01]         # 30, 31
)             # end of group #1
/             # follow by a "/"
(             # start of group #2
0?[1-9]       # 01-09 or 1-9
|             # or
1[012]        # 10,11,12
)             # end of group #2
/             # follow by a "/"
(             # start of group #3
(19|20)\\d\\d # 19[0-9][0-9] or 20[0-9][0-9]
)             # end of group #3
     The above regular expression is used to validate the date format in “dd/mm/yyyy”, you can easy customize to suit your need. However, it’s a bit hard to validate the leap year , 30 or 31 days of a month, we may need basic logic as below.
Example
package com.ashok.regularexpressions;

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class DateValidator {
    private static Pattern dateFrmtPtrn = Pattern.compile("(0?[1-9]|[12][0-9]|3[01])/(0?[1-9]|1[012])/((19|20)\\d\\d)");

    public static boolean validateDateFormat(String userName) {
        Matcher mtch = dateFrmtPtrn.matcher(userName);
        if (mtch.matches()) {
            return true;
        }
        return false;
    }

    public static void main(String a[]) {
        System.out.println("Is '03/04/2012' a valid date format? " + validateDateFormat("03/04/2012"));
        System.out.println("Is '12/23/2012' a valid date format? " + validateDateFormat("12/23/2012"));
        System.out.println("Is '12/12/12' a valid date format? " + validateDateFormat("12/12/12"));
        System.out.println("Is '3/4/2012' a valid date format? " + validateDateFormat("3/4/2012"));
    }
}

Output
Is '03/04/2012' a valid date format? true
Is '12/23/2012' a valid date format? false
Is '12/12/12' a valid date format? false
Is '3/4/2012' a valid date format? true
9. HTML tag
<("[^"]*"|'[^']*'|[^'">])*>
Description
<        # start with opening tag "<"
(        # start of group #1
"[^"]*"  # allow string with double quotes enclosed - "string"
|        # or
'[^']*'  # allow string with single quote enclosed - 'string'
|        # or
[^'">]   # cant contains one single quotes, double quotes and ">"
)        # end of group #1
*        # 0 or more
>        # end with closing tag ">" 
     HTML tag, start with an opening tag “<" , follow by double quotes "string", or single quotes 'string' but does not allow one double quotes (") "string, one single
Example
package com.ashok.regularexpressions;

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class HTMLTagValidator {
    private static Pattern email = Pattern.compile("<(\"[^\"]*\"|'[^']*'|[^'\">])*>");

    public static boolean validateHtmlTag(String userName) {
        Matcher mtch = email.matcher(userName);
        if (mtch.matches()) {
            return true;
        }
        return false;
    }

    public static void main(String a[]) {
                System.out.println("Is '<html>' a valid? " + validateHtmlTag("<html>"));
        System.out.println("Is '<input>' a valid? " + validateHtmlTag("<input>"));
        System.out.println("Is '<input value= id=’test’>' a valid? " + validateHtmlTag("<input value=' id='test'>"));
        System.out.println("Is '<input value=> >' a valid? " + validateHtmlTag("<input value=> >"));
    }
}

Output
Is '<html>' a valid? true
Is '<input>' a valid? true
Is '<input value= id=’test’>' a valid? false
Is '<input value=> >' a valid? false

     That's it guys. This is all about Regular Expressions Tutorial. Let me know your comments and suggestions about this tutorial. Thank you.

No comments:

Post a Comment