Building Resilient Systems on AWS: Learn how to design and implement a resilient, highly available, fault-tolerant infrastructure on AWS.

Regular Expressions for the Rest of Us

By Landon Schropp on March 16, 2015

Sooner or later you'll run across a regular expression. With their cryptic syntax, confusing documentation and massive learning curve, most developers settle for copying and pasting them from StackOverflow and hoping they work. But what if you could decode regular expressions and harness their power? In this article, I'll show you why you should take a second look at regular expressions, and how you can use them in the real world.

Why Regular Expressions?

Why bother with regular expressions at all? Why should you care?

Matching: Regular expressions are great at determining if a string matches some format, such as a phone number, email or credit card number.
Replacement: Regular expressions make it easy to find and replace patterns in a string. For example, text.replace(/\s+/g, " ") replaces all chunks of whitespace in text, such as " \n\t ", with a single space.
Extraction: It's easy to extract pieces of information from a pattern with regular expressions. For example, name.matches(/^(Mr|Ms|Mrs|Dr)\.?\s/i)[1] extracts a person's title from a string, such as "Mr" from "Mr. Schropp".
Portability: Almost every major language has a regular expression library. The syntax is mostly standardized, so you don't have to worry about relearning regexes when you switch languages.
Coding: When writing code, you can use regular expressions to search through files with tools such as find and replace in Atom or ack in the command line.
Clear and Concise: If you're comfortable with regular expressions, you can perform some pretty tricky operations with a very small amount of code.
Fame and Glory: Regular expressions will give you superpowers.

How to Write Regular Expressions

The best way to learn regular expressions is by using an example. Let's say you're building a web page with a phone number input. Because you're a rockstar developer, you decide to display a checkmark when the phone number is valid and an X when it's invalid.

See the Pen Regular Expression Demo by Landon Schropp (@LandonSchropp) on CodePen.

<input id="phone-number" type="text">
<label class="valid" for="phone-number"><img src="check.svg"></label>
<label class="invalid" for="phone-number"><img src="x.svg"></label>

input:not([data-validation="valid"]) ~ label.valid,
input:not([data-validation="invalid"]) ~ label.invalid {
  display: none;
}

$("input").on("input blur", function(event) {
  if (isPhoneNumber($(this).val())) {
    $(this).attr({ "data-validation": "valid" });
    return;
  }

  if (event.type == "blur") {
    $(this).attr({ "data-validation": "invalid" });
  }
  else {
    $(this).removeAttr("data-validation");
  }
});

With the above code, whenever a person types or pastes a valid number into the input, the check image is displayed. When the user blurs the input and the value is invalid, the error X is displayed.

Since you know that phone numbers are made up of ten digits, your first pass at isPhoneNumber looks like this:

function isPhoneNumber(string) {
  return /\d\d\d\d\d\d\d\d\d\d/.test(string);
}

This function contains a regular expression between the / characters with ten \d's, or digit characters. The test method returns true if the regex matches the string and false if it doesn't. If you run isPhoneNumber("5558675309"), it returns true! Woohoo!

However, writing ten \d's is little redundant. Luckily, you can use the curly braces to accomplish the same thing.

function isPhoneNumber(string) {
  return /\d{10}/.test(string);
}

Sometimes, when people type in phone numbers, they start with a leading 1. Wouldn't it be nice if your regex could handle those cases? You can with the ? character!

function isPhoneNumber(string) {
  return /1?\d{10}/.test(string);
}

The ? symbol means zero or one, so now isPhoneNumber returns true for both "5558675309" and "15558675309"!

So far, isPhoneNumber is pretty good, but you're missing one key thing: regexes are more than happy to match parts of a string. As it stands, isPhoneNumber("555555555555555555") returns true because that string contains ten numbers. You can fix this problem by using the ^ and $ anchors.

function isPhoneNumber(string) {
  return /^1?\d{10}$/.test(string);
}

Roughly, ^ matches the beginning of the string and $ matches the end, so now your regex will match the whole phone number.

Getting Serious

You released your page, and it's a smashing success, but there's one major problem. In the U.S., there are many common ways to write a phone number:

(234) 567-8901
234-567-8901
234.567.8901
234/567-8901
234 567 8901
+1 (234) 567-8901
1-234-567-8901

While your users could leave out the punctuation, it's much easier for them to type out a formatted number.

While you could write a regular expression to handle all of those formats, it's probably a bad idea. Even if you nail every format in this list, it's very easy to miss one. Besides, you really only care about the data, not how it's formatted. So, instead of worrying about punctuation, why not strip it out?

function isPhoneNumber(string) {
  return /^1?\d{10}$/.test(string.replace(/\D/g, ""));
}

The replace function is replacing the \D character, which matches any non-digit characters, with an empty string. The g, or global flag, tells the function to replace all matches to the regular expression instead of just the first.

Getting Even More Serious

Everybody loves your phone number page, and you're the king of the water cooler at work. However, being the pro that you are, you want to take things one step further.

The North American Numbering Plan is the phone number standard used in the U.S., Canada, and twenty-three other countries. This system has a few simple rules:

A phone number ((234) 567-8901) is broken up into three pieces: The area code (234), the exchange code (567) and the subscriber number (8901).
For the area code and exchange code, the first digit can be 2 through 9 and the second and third digits can be 0 through 9.
The exchange code cannot have 1 as the third digit if 1 is also the second digit.

Your regex already works for the first rule, but it breaks the second and third. For now, let's only worry about the second rule. The new regular expression needs to look something like the following:

/^1?<AREA CODE><EXCHANGE CODE><SUBSCRIBER NUMBER>$/

The subscriber number is easy; it's four digits.

/^1?<AREA CODE><EXCHANGE CODE>\d{4}$/

The area code is a little tricker. You need a number between 2 and 9, followed by two digits. To accomplish that, you can use a character set! A character set lets you specify a group of characters to choose from.

/^1?[23456789]\d\d<EXCHANGE CODE>\d{4}$/

That's great, but it's annoying to type out all the characters between 2 and 9. Clean it up with a character range.

/^1?[2-9]\d\d<EXCHANGE CODE>\d{4}$/

That's better! Since the exchange code is the same as the area code, you could duplicate your regex to finish off the number.

/^1?[2-9]\d\d[2-9]\d\d\d{4}$/

But, wouldn't it be nice if you didn't have to copy and paste the area code section of your regex? You can simplify it up by using a group! Groups are formed by wrapping characters in parentheses.

/^1?([2-9]\d\d){2}\d{4}$/

Now, [2-9]\d\d is contained in a group and {2} specifies that that group should occur twice.

That's it! Here's what the final isPhoneNumber function looks like:

function isPhoneNumber(string) {
  return /^1?([2-9]\d\d){2}\d{4}$/.test(string.replace(/\D/g, ""));
}

When to Avoid Regular Expressions

Regular expressions are great, but there's some problems you just shouldn't tackle with them.

Don't be too strict. There's little value in being too strict with regular expressions. For phone numbers, even if we did match all of the rules in NANP, there's still no way to know if a phone number is real. If I rattled off the number (555) 555-5555, it matches the pattern but it's not a real phone number.
Don't write an HTML parser. While it's fine to use regexes to parse simple things, they're not useful for parsing entire languages. Without getting too technical, you're not going to have a good time parsing non-regular languages with regular expressions.
Don't use them for really complicated strings. The full regex for emails is 6,318 characters long. A simple, imperfect one looks like this: /^[^@]+@[^@]+\.[^@\.]+$/. As a general rule of thumb, if you regular expression is longer than a line of code, it might be time to look for another solution.

Wrapping Up

In this article, you've learned when to use regular expressions and when to avoid them, and you've experienced the process of writing one. Hopefully regular expressions seem a bit less ominous, and maybe even intriguing. If you use a regex to solve a tricky problem, let me know in the comments!

If you'd like to read more about regular expressions, check out the excellent MDN Regular Expressions Guide.

About Landon Schropp

Landon is a developer, designer and entrepreneur based in Kansas City. He's the author of the Unraveling Flexbox. He's passionate about building simple apps people love to use.

LandonSchropp Posts

Recent Features

By David WalshSeptember 18, 2017
Conquering Impostor Syndrome
Two years ago I documented my struggles with Imposter Syndrome and the response was immense. I received messages of support and commiseration from new web developers, veteran engineers, and even persons of all experience levels in other professions. I've even caught myself reading the post...
By David WalshJuly 1, 2013
9 Mind-Blowing Canvas Demos
The <canvas> element has been a revelation for the visual experts among our ranks. Canvas provides the means for incredible and efficient animations with the added bonus of no Flash; these developers can flash their awesome JavaScript skills instead. Here are nine unbelievable canvas demos that...

Incredible Demos

By David WalshJuly 16, 2009
Build a Calendar Using PHP, XHTML, and CSS
One of the website features my customers love to provider their web users is an online dynamic calendar. An online calendar can be used for events, upcoming product specials, memos, and anything else you can think of. I've taken some time to completely...
By David WalshMay 7, 2012
Detect DOM Node Insertions with JavaScript and CSS Animations
I work with an awesome cast of developers at Mozilla, and one of them in Daniel Buchner. Daniel's shared with me an awesome strategy for detecting when nodes have been injected into a parent node without using the deprecated DOM Events API.

Discussion

Fredrik
And then try to use the same regular expression on international phone numbers, postal codes etc. ;-)

Landon Schropp
Ha ha, that’s where it starts to get a little bit more complicated.

Šime Vidas
A (x) symbol is not a good choice as an indicator for invalid input, as that exact symbol is already widely used for clearing input fields. You even put it at the same spot (at the right edge inside the field).

Landon Schropp
Yep, I agree. I was just trying to keep the example light.

Cathy Mayhue
I remember, as a programmer, I wrote long pieces of code to get my search scripts right, more were the parameters, more was the complexity and lower was the speed. Learning regular expressions was the best decision in my life, it made the search scripts simpler and faster. You gave a very nice teaser of sorts for the regular expressions.

Landon Schropp
Thanks!

Jspdown
If you want a nice regexp debugger: http://debuggex.com
That’s always helpful to get a graphical feedback from a regexp
Justice
Are Regular Expressions the same in all programming languages? For example, is Regular Expression syntax in PHP the same with the JAVASCRIPT or any other programming language?

Michael Ash
No regular expression are not the same across language. The basic syntax is mostly the same but feature support and advanced syntax varies widely.
Landon Schropp
Michael’s right. Different languages support different features, and sometimes the syntax is different. However, the core concepts are usually the same and should translate.

Hussain
@Landon you really made regx very easy. I bet, after going through this article programmers are gonna use it no matter how slow or inefficient(some cases) is this. I believe it’s a very good piece of programming art for particularly beginner or intermediate level. Because the way you squeezed repetition in a concise code with a perfect presentation is just a fine example of code re-usability, need for a new function(in case of code anomaly) and simplicity.

Landon Schropp
Thanks Hussain, I appreciate that. :)

Tony
Took me a few tries (and a few more when I discovered that the JS regex engine doesn’t support (?ifthen|else)), but I think I have an expression that covers all three of the NANP rules. It’s not too terribly lengthy, character-count-wise, but it’s a *fantastic* example of how quickly regex can go from clear and concise to hieroglyphics that get copy/pasted with fingers crossed. Here’s the expression:
Rituparna
Ohh my God! I have just found the thing that i was looking for a long time.. I can teach myself anything except this RegEx..

Thanks a ton :) I will practice thousands time
Ross
```
/^[^@]+@[^@]+\.[^@\.]+$/
```
I realise the above is just a quick & imperfect example for emails, but it would exclude domains like “.co.uk” which are very common in some countries, so maybe worth amending to be “too lax” rather than “too strict” to make it safe to use? The below should allow the two dot domains, though perhaps it’s simpler just to leave out “not .” in the group after the first dot match.
```
/^[^@]+@[^@]+\.[^@\.]+\.?[^@\.]*$/
```

Landon Schropp
Good call!

Nibin
At least I could understand the basics, thanks:)
Alan
Awesome tutorial but… That email regex… To quote Princess Bride: MY GOD WHAT IS THAT THING. Could someone please explain it?