New Website

I've made a new website, as lynx.io is dead. You can find it here: http://macr.ae/—it's similar in format to lynx.io, but has better articles, and they're all written by me.

JavaScript disabled

While it will still mostly work, a lot of this site's functionality relies on JavaScript - please enable it for the best experience.

What are regular expressions?

Regular expressions are a way for programs to search for a particular pattern in a string. For example, they can be used to find strings starting with "www.", or to see if some user input is a valid email. Regular expressions are incredibly useful for validating user input and also for the mod_rewrite function of the Apache server.

It is worth noting that the phrase "regular expression" is often shortened to "regex" or "regexp".

Where are they used?

Regular expressions, and close variants of this, are used in many different environments, including, but not limited to:

As with most things, different environments are not always fully compatible with each other; however the basic syntax and ideas discussed within this article are common to most.

Literal Characters

The simplest regexes are just matching for a simple text pattern. Literal characters include every character except the special characters discussed below. They have no special meaning and simply match for that particular letter.

For example the regex cat will match cat in About cats and dogs. Note that it only matches the first occurrence of the pattern by default but it can be set to return all matches too.

Special Characters

Special characters in regular expressions provide additional functionality to create more complex regular expressions. The most common special characters and their functions are described below:

Character(s) Function Examples

[ and ]

Matches one out of several characters between the two square brackets.

a[bc]d matches abd and acd

\

Allows the following special character to be used as a literal character.

\[hi\] matches [hi] not h or i

^

Used within [ and ] to denote characters that are not allowed.
or
Also denotes that the pattern should be matched at the start of the string.

q[^u] matches qa but not q or qu
or
^cat matches cat but not scat

$

Denotes that the pattern should be matched at the end of the string.

cat$ matches scat not cats

.

Matches any character, except new line characters.

a.e matches ace and ade

|

Similar to [ and ] for single characters, but allows one of several regexes to be matched. Often used with ( and ) to group possible regexes.

cat|dog matches cat or dog not fish

?

Makes the preceding part of the regex optional.

colou?r matches colour and color

*

Zero or more of the preceding part of the regex.

be*n matches bn and been

+

One or more of the preceding part of the regex.

be+n matches ben and been not bn

( and )

Groups part of a regex together, typically used with | and ? to give choice of options or to make a section of the regex optional.

(cat|dog)s matches cats and dogs
or
lawn(mower)? Matches lawn and lawnmower

{ and }

Limits the repetition of a part of a regex. Can be used as a replacement to * and +.
It has three uses:

  • {n} repeats previous term n times
  • {n,} repeats previous term n or more times
  • {n,m} repeats previous term between n and m inclusive

A{3} only matches AAA
A{3,} matches AAA and AAAA etc
A{3,5} only matches AAA, AAAA and AAAAA

-

Allows a range of values.

a[b-y]z matches agz not aaz

Examples of Use

Probably the easiest way to describe the usage of regexes is to use some examples, in the context of a data validation script.

Names

First we shall look at validating a name (either first name or surname). The regex we will use is:

^[A-Za-z-]{2,50}$

Let's dissect it down into several parts:

Matches: Sam, Anne-Marie
Does not match: Sam54, BOB!

Date

This checks that the date is in the correct format, but not necessarily a valid date (eg 31 February 2010). This matches dates in the form DD/MM/YY, DD/MM/YYYY, DD-MM-YY, DD-MM-YYYY. It allows padding zeros be left out, for example 9/2/10.

^(([1-9])|(1-2][0-9])|(3[0-1])) (/|-) ((0?[1-9])|(1[0-2])) (/|-)  (([0-9]{2})|((19|20) [0-9]{2}))$

Again, let's split it into several parts:

Matches: 03/03/2010, 3/3/10, 03-03-1994, 3-3-94
Does not match: 34/02/2010, 02/25/2010, 30.10.2010

Note: For use with American dates (MM/DD/YYYY general format), the following regex will work:

^((0?[1-9])|(1[0-2]))(/|-)(([1-9])|(1-2][0-9])|(3[0-1]))(/|-)(([0-9]{2})|((19|20)[0-9]{2}))$

Email

Regexes to validate email address can become quite complex, therefore this particular example will not be explained in as much detail.

^([0-9a-zA-Z]([-.\w]*[0-9a-zA-Z])*@([0-9a-zA-Z][-\w]*[0-9a-zA-Z]\.)+[a-zA-Z]{2,9})$

Matches: e@abc.com, bob.jones@my.business.co.uk
Does not match: .@eee.com, eee@e-.com, eee@eee.eeeeeeeeee

Conclusion

Regexes can be used to great effect for selection of files, data validation and for rewriting URLS; pretty much any application of pattern matching. Regexes are well documented and many sites exist providing sample regexes for specific jobs.

Please note that often regexes are case sensitive and that different environments (PHP, Java etc.) may treat the same regex slightly different.

About Sam Haines:

Sam Haines is a web developer and has been making websites and experimenting with various web technologies for the past five years.

You can view more articles by this author here.

Tags: php regex

Comments

Callum Macrae says:

With PHP I find that the best regex to use for emails is:

/^[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}$/i

This matches most email addresses, but doesn’t match a few like these, both of which are perfectly valid:

email@address.museum

email@99.198.122.146

However, not many email addresses use those syntaxes, and I haven't had a problem with them yet.

says:

Add comment

 

You can use markdown in comments (press "m" for a cheatsheet).

Enable JavaScript to post a comment

Markdown Cheat Sheet

# This is an <h1> tag
## This an <h2> tag
###### This is an <h6> tag

Inline markup: _this text is italic_, **this is bold**, and `code()`.

[Link text](link URL "Optional title")
[Google](http://google.com/ "Google!")

![Alt text](image URL)

![This is a fish](images/fish.jpg)

1. Ordered list item 1
2. Ordered list item 2

* Unordered list item 1
* Unordered list item 2
* Item 2a
* Item 2b

And some code:

// Code is indented by one tab
echo 'Hello world!';

Horizontal rules are done using four or more hyphens:

----

> This is a blockquote

This is an <h1> tag

This an <h2> tag

This is an <h6> tag

Inline markup: this text is italic, this is bold, and code().

Link text Google

This is a fish

  1. Ordered list item 1
  2. Ordered list item 2
  • Unordered list item 1
  • Unordered list item 2
    • Item 2a
    • Item 2b

And some code:

// Code is indented by one tab
echo 'Hello world!';

Horizontal rules are done using four or more hyphens:


This is a blockquote

Toggle MarkDown / HTML (t), full reference or close this