JavaScript disabled
While it will still mostly work, a lot of this site's functionality relies on JavaScript - please enable it for the best experience.
While it will still mostly work, a lot of this site's functionality relies on JavaScript - please enable it for the best experience.
Regular expressions are a way for programs to search for a particular pattern in a string. For example, they can be used to find strings starting with "www.", or to see if some user input is a valid email. Regular expressions are incredibly useful for validating user input and also for the mod_rewrite function of the Apache server.
It is worth noting that the phrase "regular expression" is often shortened to "regex" or "regexp".
Regular expressions, and close variants of this, are used in many different environments, including, but not limited to:
As with most things, different environments are not always fully compatible with each other; however the basic syntax and ideas discussed within this article are common to most.
The simplest regexes are just matching for a simple text pattern. Literal characters include every character except the special characters discussed below. They have no special meaning and simply match for that particular letter.
For example the regex cat will match cat in About cats and dogs. Note that it only matches the first occurrence of the pattern by default but it can be set to return all matches too.
Special characters in regular expressions provide additional functionality to create more complex regular expressions. The most common special characters and their functions are described below:
| Character(s) | Function | Examples |
|---|---|---|
[ and ] |
Matches one out of several characters between the two square brackets. |
a[bc]d matches abd and acd |
\ |
Allows the following special character to be used as a literal character. |
\[hi\] matches [hi] not h or i |
^ |
Used within [ and ] to denote characters that are not allowed. |
q[^u] matches qa but not q or qu |
$ |
Denotes that the pattern should be matched at the end of the string. |
cat$ matches scat not cats |
. |
Matches any character, except new line characters. |
a.e matches ace and ade |
| |
Similar to [ and ] for single characters, but allows one of several regexes to be matched. Often used with ( and ) to group possible regexes. |
cat|dog matches cat or dog not fish |
? |
Makes the preceding part of the regex optional. |
colou?r matches colour and color |
* |
Zero or more of the preceding part of the regex. |
be*n matches bn and been |
+ |
One or more of the preceding part of the regex. |
be+n matches ben and been not bn |
( and ) |
Groups part of a regex together, typically used with | and ? to give choice of options or to make a section of the regex optional. |
(cat|dog)s matches cats and dogs |
{ and } |
Limits the repetition of a part of a regex. Can be used as a replacement to * and +.
|
A{3} only matches AAA |
- |
Allows a range of values. |
a[b-y]z matches agz not aaz |
Probably the easiest way to describe the usage of regexes is to use some examples, in the context of a data validation script.
First we shall look at validating a name (either first name or surname). The regex we will use is:
^[A-Za-z-]{2,50}$
Let's dissect it down into several parts:
^ and $ at the start and end of the regex require the entirety of the string to match[A-Za-z-] matches all letters, both upper and lowercase. The hyphen is required at the end for names like Anne-Marie{2,50} requires at least 2 and at most 50 characters. We're assuming people won't have names longer than 50 characters.Matches: Sam, Anne-Marie
Does not match: Sam54, BOB!
This checks that the date is in the correct format, but not necessarily a valid date (eg 31 February 2010). This matches dates in the form DD/MM/YY, DD/MM/YYYY, DD-MM-YY, DD-MM-YYYY. It allows padding zeros be left out, for example 9/2/10.
^(([1-9])|(1-2][0-9])|(3[0-1])) (/|-) ((0?[1-9])|(1[0-2])) (/|-) (([0-9]{2})|((19|20) [0-9]{2}))$
Again, let's split it into several parts:
^ and $ at the start and end of the regex require the entirety of the string to match((0?[1-9])|(1-2][0-9])|(3[0-1])) allows the day of the month to be anywhere between 1 and 31, with or without a zero for padding.(/|-) between each group is to match the forward slashes or hyphen in the date.((0?[1-9])|(1[0-2])) matches the month number, between 1 and 12, with or without the zero for padding.(([0-9]{2})|((19|20) [0-9]{2})) matches a 2 or 4 digit year. If the year is 4 digit it requires it to be in the form 19xx or 20xx.Matches: 03/03/2010, 3/3/10, 03-03-1994, 3-3-94
Does not match: 34/02/2010, 02/25/2010, 30.10.2010
Note: For use with American dates (MM/DD/YYYY general format), the following regex will work:
^((0?[1-9])|(1[0-2]))(/|-)(([1-9])|(1-2][0-9])|(3[0-1]))(/|-)(([0-9]{2})|((19|20)[0-9]{2}))$
Regexes to validate email address can become quite complex, therefore this particular example will not be explained in as much detail.
^([0-9a-zA-Z]([-.\w]*[0-9a-zA-Z])*@([0-9a-zA-Z][-\w]*[0-9a-zA-Z]\.)+[a-zA-Z]{2,9})$
Matches: e@abc.com, bob.jones@my.business.co.uk
Does not match: .@eee.com, eee@e-.com, eee@eee.eeeeeeeeee
Regexes can be used to great effect for selection of files, data validation and for rewriting URLS; pretty much any application of pattern matching. Regexes are well documented and many sites exist providing sample regexes for specific jobs.
Please note that often regexes are case sensitive and that different environments (PHP, Java etc.) may treat the same regex slightly different.
# This is an <h1> tag
## This an <h2> tag
###### This is an <h6> tag
Inline markup: _this text is italic_, **this is bold**, and `code()`.
[Link text](link URL "Optional title")
[Google](http://google.com/ "Google!")


1. Ordered list item 1
2. Ordered list item 2
* Unordered list item 1
* Unordered list item 2
* Item 2a
* Item 2b
And some code:
// Code is indented by one tab
echo 'Hello world!';
Horizontal rules are done using four or more hyphens:
----
> This is a blockquote
Inline markup: this text is italic, this is bold, and code().

And some code:
// Code is indented by one tab
echo 'Hello world!';
Horizontal rules are done using four or more hyphens:
This is a blockquote
Comments
With PHP I find that the best regex to use for emails is:
This matches most email addresses, but doesn’t match a few like these, both of which are perfectly valid:
email@address.museumemail@99.198.122.146However, not many email addresses use those syntaxes, and I haven't had a problem with them yet.